Oncogenic CALR mutant C-terminus mediates dual binding to the thrombopoietin receptor triggering complex dimerization and activation

Calreticulin (CALR) frameshift mutations represent the second cause of myeloproliferative neoplasms (MPN). In healthy cells, CALR transiently and non-specifically interacts with immature N-glycosylated proteins through its N-terminal domain. Conversely, CALR frameshift mutants turn into rogue cytokines by stably and specifically interacting with the Thrombopoietin Receptor (TpoR), inducing its constitutive activation. Here, we identify the basis of the acquired specificity of CALR mutants for TpoR and define the mechanisms by which complex formation triggers TpoR dimerization and activation. Our work reveals that CALR mutant C-terminus unmasks CALR N-terminal domain, rendering it more accessible to bind immature N-glycans on TpoR. We further find that the basic mutant C-terminus is partially α-helical and define how its α-helical segment concomitantly binds acidic patches of TpoR extracellular domain and induces dimerization of both CALR mutant and TpoR. Finally, we propose a model of the tetrameric TpoR-CALR mutant complex and identify potentially targetable sites.


Supplementary Figures
Supplementary Figure 1. CALR species used in this study. coverage and the number of peptides detected are indicated. c-d. Wood's plots linked to Figure 1C-D generated with Deuteros 2.0 2 . Each bar (wood) represents the H-D exchange differential for a single peptide between CALR WT and CALRDC-tail (c) or between CALR WT and CALR del52 (d) at 0,0.25,1,5,15 or 60 minutes incubation in deuterium. Peptides in red (deprotected) or blue (protected) have significant differential H-D exchange (p<0.001) with the peptide-level significance testing (n = 3) as described 2 . The N-, P-and C-domains of CALR are indicated on the plots by letters N, P and C, respectively. Source data are provided as a Source data file. the PC1-PC2 space performed on the 1720-1480 cm -1 spectral region. Each star stands for one spectrum. For the sake of clarity, a colour is associated with each sample but the analysis is completely unsupervised. Percentages on the axis labels indicate the variance described by PC1 (59.7%) and PC2 (33.8%). A mean centering (subtraction of the arithmetic mean from all the spectra) was applied on this set of data. c. (a) Representation of the original data in terms of the two axes x and y. Each point represents one IR spectra. (b) As a result of the PCA the axes are rotated and the data will be represented in the two-dimensional principal component space. Thereby PC1 represents the largest variance in the data.

Supplementary Figure 5. Raw FTIR spectra of CALR species.
a-d. Comparison of the processed mean spectra recorded. The difference between the two mean spectra indicated is shown in black -zoom on the spectral region related to proteins absorption (1720-1480 cm -1 ). A Student's t-test was carried out at each wavenumber with a confidence level α=0.1%. The significant spectral differences are revealed with black stars on the difference spectrum. Each spectrum is identified by a unique colour indicated in the legend.

Supplementary Figure 6. CALR mutant C-terminus specifically interacts with
TpoR in absence of immature N-glycans. a. Coomassie Blue staining of TpoR D1-D4 with mature N-glycans. Purified TpoR D1-D4 was analyzed by SDS-PAGE in denaturing and reducing conditions and stained with Coomassie Blue for total protein detection. Representative gel from 3 experiments. b. Thermal stability of TpoR D1-D4 with mature N-glycans. The graphs represent the 350/330 nm intrinsic fluorescence from Trp and Tyr residues at different temperatures. The S-shaped curve is typical of well-folded proteins as the accessibility of Tyr and Trp residues gradually increases upon temperature-induced protein unfolding. Source data are provided as a Source data file.
CALRdel52 was labeled with RED-NHS 2 nd Generation chemistry according to the manufacture's instruction (NanoTemper Technology). The curve corresponds to the mean (± SD) of two independent experiments following the fluorescence of the target protein (CALRdel52-NHS) with titration of the the ligand (TpoR D1-D4). While the concentration of the target is kept constant at 20 nM, the ligand concentration ranges from 5uM and 0,15 nM. The binding curve represents the percentage of bound fraction of CALR del52 to TpoR D1-D4 and yields a KD of 104 nM ± 3.75 nM. Source data are provided as a Source data file. d. Sequence coverage obtained for the H-D exchange analysis between CALR del52 alone and CALR del52 in complex with TpoR D1-D4 with mature N-glycan. The percentage of sequence coverage and the number of peptides detected are indicated. e. Wood's plots linked to Figure   2A generated with Deuteros 2.0 2 . Each bar (wood) represents the H-D exchange differential for a single peptide between CALR del52 alone and CALR del52 in complex with TpoR D1-D4 with mature N-glycans at 0,0.25,1,5,15 or 60 minutes incubation in deuterium. Peptides in red (deprotected) or blue (protected) have significant differential H-D exchange (p<0.001) with the peptide-level significance testing (n = 3) as described 2 . The N-, P-and WT C and mutant C-domains of CALR are indicated on the plots by letters N, P and C and Cmut, respectively. Source data are provided as a Source data file. f. Deuterium uptake (Da) of the indicated peptides of the mutant C-terminus from CALR del52 alone or in complex with TpoR D1D4 (with mature N-glycans) at 5 different exchange time points. The dotted lines represent standard deviation (SD), the full line represents average of triplicates. Source data are provided as a Source data file. g. STAT5 transcriptional activity in presence of EpoR and indicated CALR truncations. HEK293T were transiently transfected with human EpoR and CALR truncations along with cDNAs coding for STAT5, JAK2 and SpiLuc Firefly luciferase reporter reflecting STAT5 transcriptional activity and normalized with a control reporter (pRLTK) containing Renilla luciferase. Data represent mean ± SD (n = 6 biologically independent samples over 2 independent experiments). Data were analyzed with two-ways ANOVA followed by Sidak multiple comparison test. ns: non-significant (p>0.05). Source data are provided as a Source data file.

Supplementary Figure 7. Donor Saturation Assay between TpoR-NanoLuc and CALR del52-HaloTag.
Donor Saturation Assay between TpoR-NanoLuc and indicated CALR del52-HaloTag (HT) construct. HEK293T were co-transfected with fixed amount of donor (TpoR-NanoLuc) and increasing ratios of acceptor (HaloTag fusion proteins). The negative control corresponds to a HaloTag protein non-fused to TpoR. A specific BRET signal will increase in a hyperbolic manner before reaching a plateau. A non-specific interaction will be less intense and increase linearly without reaching a plateau. The shape of the curve in a Donor Saturation Assay provides a control for the specificity of the interaction according to the manufacturer instruction (Promega).

Supplementary Figure 8. Interaction between CALR mutant and TpoR in
presence of immature N-glycans. a. Coomassie Blue staining of the complex CALR del52-TpoR D1D2 with immature N-glycans on Asn117. The purified complex was analyzed by SDS-PAGE in denaturing and reducing conditions and stained with Coomassie Blue for total protein detection. Representative gel from 3 experiments. b. Thermal stability of the CALR del52-TpoR D1D2 complex. The graphs represent the 350/330 nm intrinsic fluorescence from Trp and Tyr residues at different temperatures. The S-shaped curve is typical of well-folded proteins as the accessibility of Tyr and Trp residues gradually increases upon temperature-induced protein unfolding. Source data are provided as a Source data file. c. Sequence coverage obtained for the H-D exchange analysis between CALR del52 alone and the CALR del52-TpoR D1D2 complex with immature N-glycans on TpoR Asn117. The percentage of sequence coverage and the number of peptides detected are indicated. d. Wood's plots linked to Figure 3E

Production and purification of recombinant proteins
Recombinant human CALR wild-type, CALR del52 and its derivatives contain a Nterminal His tag sequence (MGSHHHHHHGSSG) that replaces the CALR signal peptide sequence (a.a. 1-17). In addition, the cysteine 163 was mutated to serine. The amino acid sequence of human TpoR D1D2D3D4 (TpoR D1-D4) starts at Q26 and ends at T489 and this of human TpoR D1D2 (TpoR D1D2) starts at Q26 and ends at Q290. Both contains a histidine tag at the C-terminus. The amino acid sequence of human CALR WT starts at E18 and ends at L417.
The sequences of all recombinant proteins are provided below.  Figures 1 and 2). The correct folding of the proteins was validated by thermal unfolding experiments with Tycho NT.6 (Supplementary Figure 1). To obtain meaningful comparison, five FTIR spectra were recorded for each sample described here above (after buffer exchange). FTIR spectra (raw data, without any pre-processing step) are provided in Supplementary Figure 4a.

Result analysis
To interpret the FTIR data in terms of secondary structure, two types of analysis were performed: • Comparison of the spectra in the region specific to proteins absorption to evaluate the similarity between the wild type protein and the two mutated forms; • A prediction of the secondary structure content using an in-house database of FTIR spectra of proteins. The CALR del52 A394* and the CALR del52 separate from the CALR WT along the first principal component (PC1). The CALR del52 A394* is the closest to the CALR WT. The two other mutants separate from the CALR WT along the PC1 and the PC2.

Secondary structure prediction
As described in the methods section, the estimation was realized using three wavenumbers in the Amide I and II bands. The wavenumbers used for this secondary structure determination are the following: The table shown in main Figure 1f presents the results of this prediction for each sample.
According to the prediction obtained, samples CALR WT and CALR del52 A394* have a similar α-helix, β-sheet, turn and random structures content. Sample CALR del52 has a lower content in α-helix and a slightly higher content in random structures. The α-helix content further decreases for sample CALRdel52 D135L/Y109F and becomes null for CALR ∆C-tail. In addition, the CALR del52 D135L/Y109F has a higher content in turn and the CALR ∆C-tail has a higher turn and random contents. A slight increase of the β-sheet content is also observed for the CALR ∆C-tail.
For the present predictions, the standard error of prediction in cross-validation obtained using the 50-protein database is 5.7% for the α-helix and 6.7% for the β- GmbH, Ettlingen, Germany). The FTIR-spectrometer was equipped with a Mercury-Cadmium-Telluride detector, which was cooled down with liquid nitrogen. The spectra were recorded with the ATR mode by using a Golden GateTM ATR accessory (Specac, Orpington, United Kingdom) with an integrated total reflection element composed of a single reflection diamond. The angle of incidence was 45 degrees.

FTIR measurement
0.5 μL of sample was loaded on the diamond crystal of the ATR device of the FTIR spectrometer and quickly dried with a constant, gentle nitrogen flow: elimination of the water molecules prevents overlapping of the large water absorption peaks with the sample's absorption spectrum. After each spectrum, the crystal was cleaned with water. A background was recorded with a clean crystal before the start of the measurement and before every new sample. FTIR spectra were recorded between 4000 and 600 cm -1 at a resolution of 2 cm -1 . Each spectrum was obtained by taking an average of 128 scans. The FTIR measurements were carried out at room temperature (~22°C). For each sample, at least four spectra were recorded.

Multivariate data analysis
Each wavelength in an IR spectrum is considered as a variable. There are therefore a few thousand wavenumbers at which biological molecules absorb, and several spectra x and y will be rotated and the principal components form the new axes.
In fact, there are as many principal components as variables in the data. However, the first few principal components represent generally over 99% of the present variance in the data. Thus, PCA permits reducing the dimensionality of the spectral data while retaining the majority of the information. This is simply done by projecting the spectra in the principal components space.
The representation of the composition of all spectra in terms of the PC is called score plot. Each point or star in a score plot represents a spectrum. Thus, a score plot permits visualising similarities and difference between spectra and to determine if the spectra are related with each other by forming groups 5 .

Secondary structure prediction
Using a database of 50 protein containing as little fold redundancy as possible, an ascending stepwise method was applied to determine the protein secondary structure.
It was demonstrated that three wavenumbers contain all the nonredundant information related to the secondary structure content. The standard error of prediction in crossvalidation obtained using the 50-protein database was 5,7% for the α-helix and 6,7% for the β-sheet, 3.2% for turns and 8% for random 6,7 .

Statistical analysis
Supplementary Figure 5 presents the mean preprocessed spectra for each sample, with a zoom on the spectral region related to protein absorption. A statistical pairwise comparison with the wild type was also performed to evidence spectral changes. The difference spectrum (black spectrum) corresponds to the difference between the mean spectra of each sample. The black stars on the difference spectrum refer to significant differences defined by a Student's t-test at each wavenumber. Details on the statistical techniques are provided here below.

Student's t-test
In order to evidence spectral changes between samples, the mean spectrum of one was subtracted from the mean spectrum of another. We thus obtained a "difference spectrum". All difference spectra were calculated with fully preprocessed spectra (baseline corrected and normalized).
The Student's t-test is a parametrical hypothesis test. It is used to determine whether two populations are significantly different from each other or not by comparing the means of the measurements derived from these two populations. The test is applicable if the measurements follow a normal distribution, and the variance of each population is the same.
Two hypotheses are tested: -H0: ! = " which means that there is no difference between the means of the two populations -H1: ! ≠ " which means that the means of the two populations are significantly different where # is the signal intensity for a given wavelength.
The t-test statistic is calculated as follows: where ̅ # is the mean of the sample , $ " is the estimated common variance for the two samples and # the number of the sample in the population .
The variance $ " is calculated with the following formula: where # is the standard deviation of the sample and # the number of the samples in the population .
The test was carried out with a significance level of α = 0.1% (p<0.001). This threshold is defined as the probability of rejecting the null hypothesis under the assumption that it is true.
Student's t-tests were computed at every wavenumber and allowed a statistical comparison between the spectra of the two samples. Wavenumbers where a significant difference occurs (with a significance α = 0.1%) are indicated by black stars.

Nano-Bioluminescence Energy Transfer (BRET) is a technique that measures
proximity between two proteins in living cells. When the two partners are in close proximity (< 10 nm), bioluminescence energy transfer (BRET) occurs between a donor (NanoLuciferase) and an acceptor (HaloTag ligand. This technique has been used to measure protein-protein interaction in living cells between a wide variety of proteins thanks to its ease of use, reproducibility and specificity 8 .

Construct
The Thrombopoietin Receptor (TpoR) was cloned into pNL-N vector (Promega) to generate the N-terminally fused NanoLuc-TpoR construct as described. Extracellular forms of the receptor (TpoR D1-D4, TpoR D1D2 and TpoR D1) were obtained from this initial construct by introducing a stop codon by site-directed mutagenesis.
The CALR del52-HaloTag construct was generated by cloning the cDNA from CALR del52 into the pHT-C vector (Promega) to generation the CALR del52-HaloTag fusion  to be used to model the insertion and bring the TpoR-CRM1 architecture from a 3/4 βsandwich (EpoR) to a 4/5 β-sandwich TpoR model.
In order to computationally investigate the interaction between CRM1 and CALR-del52, a model of the mutant C-terminus of CALR del52 was also generated using tLEaP and the FF14SB 21 protein forcefield. Given the secondary structure propensity of CALR del52 C-terminus, this region was modelled first as a straight, ~72Å long, 12 turn α-helix. This structure was then subjected to a 1 μs MD simulation in order to gather the conformational pool to be used in the selection of several starting CALR del52 mutant C-terminus structures for CALR del52 C-terminus-TpoR D1D2 docking.

Trajectory analysis
A strong bending motion was noticed in the TpoR molecule of Pose 3 (lowest binding free energy) between the two FN-III-like domains (Supplementary Figure 10c-d).
This was not observed in the other poses (Supplementary Figure 10e-f) which could indicate that the presence of the CALR del52 C-terminus has a stabilizing effect on the TpoR. This might also imply that the domain is more flexible when unbound, but the inter domain joint stiffens when bound.

Conformational discretization:
Microstates were delimited using Time-Lagged Independent Component Analysis (TICA). The backbone dihedral angles of the CALR del52 mutant C-terminus molecule were used as input coordinates for TICA. TICA and free energy surfaces were computed using the PyEMMA (2.5.11) python package 31 , and the resulting plots were generated using the Matplotlib (3.5.1) python package 32 .
The inflection core state (InfleCS) clustering method 33 was used to cluster the two transformed coordinates with the highest eigenvalues and the associated cluster centers were plotted on the corresponding free energy surface. Clustering was performed using 10 components, and re-estimation of the same model was done 5 times. Bayesian information criterion was used for identifying the model.  Figure 13d).

Modelling of the TpoR-CALR del52 and ins5 tetramers
Initial models from AlphaFold 2.0 1 and RosettaDock 34 were used to build the tetrameric 3D TpoR-CALR mutant models and to identify the interaction interface between the two CALR del52 and CALR ins5 mutants. H-D exchange data was used to identify contacts between TpoR and CALR del52 in the formation of the tetramer complex.
The ER specific G1M9 glycans of TpoR in contact to CALR mutant were modelled with Glycopack 35 and the Glycam server in the configuration consistent with NMR data 36 while the rest are of complex type, built in agreement with SAGS Database 37,38 .
The contacts identified by HDx-MS and the crosslinking data on the TM region configuration of TpoR dimer were used as constraints in generating the overall 2CALR del52-2TpoR model. The glycoproteic tetramer was then gently optimized in a five stage process in implicit solvent: (1) first, this was heated to 300K over 1 ns with harmonic cartesian constraints of K=1 on all backbone atoms predicted to be found in secondary structures and K=0.5 on backbone atoms in predicted coil regions; then (2) the system was subjected to equilibration for 2 ns with harmonic cartesian constraints of K=0.5 on secondary structures and no constraints on predicted coil region; this was followed by (3) a further equilibration of 20ns with distance based harmonic constraints of K=1 on interdomain contact points and hydrogen bonds in predicted secondary structure regions; then (4) the system was cooled over 2 ns with distance based constraints in place and finally (5) extensively minimized without constraints.
This glycoproteic tetramer was then immersed into a full-atom representation of the environment -consisting of a lipid bilayer of 1162 POPC molecules accommodating the TM region of TpoR and in 263023 TIP3P water molecules, 726 chloride and 789 sodium ions describing the solvent region hydrating the rest of the tetramer using the CHARMM-GUI server 39 . This overall system consisting of ~ 1 million atoms was subjected to further unconstrained extensive minimization to obtain the final model for MD simulation. A similar procedure was used for preparing the system containing the TpoR-CALR ins5 tetramer.
The explicit solvent MD simulations were performed with NAMD v.2.13 36 and the CHARMM36 37-39 forcefield. at constant pressure (1 atm) in two steps: (1) heating using a 1fs time step in order to ensure an even energy distribution, followed by (2) constant temperature simulation using a 2 fs timestep. All MD simulations used a Langevin integrator, with a coupling coefficient of 1 ps -1 .
Production MD simulations of the heterotetrameric TpoR-CALR del52 and ins5 systems were carried out for 100 ns in triplicate. Root-mean-square deviation (RMSD) analysis of these simulation was performed using MDAnalysis 41,42 and the corresponding plots are presented in Supplementary Figure 8 (CALR del52) and Supplementary Figure 9 (CALR Ins5). For the analysis of inter-residues contacts, the first half (50 ns) of each simulation was discarded as equilibration and only the second half (which shows a plateau in the RMSD plot) was used for computing the distances.

Tetramer trajectory analysis
Inter-domain and glycan-CALR contacts were computed using MDTraj 19 and were averaged over three 100ns runs. A threshold of 8 Å was used as contact cutoff between heavy atoms and the data was collected every 0.5 ns starting from 50 ns -to account for a model equilibration period -resulting in 100 frames per run. Shown are only contacts present, on average, in more than 60% of frames.

Thermal unfolding and stability
Thermal unfolding experiments were conducted to measure the Tm of all proteins used protein integrity and Tm. The first derivative of the 350/330 ratio informs on the purity of the protein preparation. A pure sample will have a single peak, indicating that a single protein specie is present in the sample while samples that are not pure will have multiple peaks. All measurements were performed with Tycho NT.6 (NanoTemper).